fix(aws_s3 source): add retry delay in sqs::Ingestor #22999

medzin · 2025-05-06T21:26:01Z

Summary

This commit adds logic to wait 500ms between subsequent calls of run_once if it returns an error. This is required to prevent spamming the SQS API if it fails to receive messages, e.g. due to invalid IAM permissions or an SQS outage.

Change Type

Bug fix
New feature
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

How did you test this PR?

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the "no-changelog" label to this PR.

Notes

Please read our Vector contributor resources.
Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
The CI checks run only after we manually approve them.
- We recommend adding a pre-push hook, please see this template.
- Alternatively, we recommend running the following locally before pushing to the remote branch:
  - cargo fmt --all
  - cargo clippy --workspace --all-targets -- -D warnings
  - cargo nextest run --workspace (alternatively, you can run cargo test --all)
  - ./scripts/check_changelog_fragments.sh
After a review is requested, please avoid force pushes to help us review incrementally.
- Feel free to push as many commits as you want. They will be squashed into one before merging.
- For example, you can run git merge origin master and git push.
If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
run cargo vdev build licenses to regenerate the license inventory and commit the changes (if any). More details here.

References

Closes: No retry backoff for sqs in aws_s3 source #22947

bits-bot · 2025-05-06T21:26:05Z

All committers have signed the CLA.

This commit adds logic to wait 500ms between subsequent calls of run_once if it returns an error. This is required to prevent spamming the SQS API if it fails to receive messages, e.g. due to invalid IAM permissions or an SQS outage.

pront

Thanks @medzin, I understand the problem you are trying to fix here. Left a question.

src/sources/aws_s3/sqs.rs

pront · 2025-06-13T14:25:37Z

src/sources/aws_s3/sqs.rs

+                        Ok(()) => {}
+                        Err(_) => {
+                            trace!("run_once failed, will retry after delay");
+                            tokio::time::sleep(delay).await;


I wonder how this constant was chosen. Would an exponential backoff strategy work better here or it's an overkill?

I was treating this code as a hotfix rather than a long-term solution, so I chose a simple constant delay to quickly address the issue. The current 0.5-second delay doesn't introduce significant latency but greatly reduces the number of retries triggered by the source, which helps stabilize things short-term. That said, I agree a constant delay may not be ideal across all use cases - different values might work better for different users. An exponential backoff could offer a more flexible and resilient solution without needing to make the delay configurable. Happy to explore that in a follow-up PR.

I can see how this improves things in your situation but we generally don't want to merge hotfixes. If you have urgent need for this commit, you can always maintain a fork with this commit and use a custom Vector build.

Going back to this PR, what do you think about introducing a new opt-in config parameter here? We can probably reuse ExponentialBackoff from src/sinks/util/retries.rs.

Example:

1st retry: 1 second

2nd retry: 2 seconds

3rd retry: 4 seconds

4th retry: 8 seconds

5th retry: 16 seconds

6th retry: 30 seconds (capped)

Sure. I will check it.

pront · 2025-10-13T14:00:43Z

Couldn't commit to this branch due to perms, moved to: #23996

medzin requested a review from a team as a code owner May 6, 2025 21:26

github-actions bot added the domain: sources Anything related to the Vector's sources label May 6, 2025

medzin force-pushed the master branch from 89e2f1a to 29e4bef Compare May 6, 2025 21:33

fix(aws_s3 source): add retry delay in sqs::Ingestor

ba6e8c0

This commit adds logic to wait 500ms between subsequent calls of run_once if it returns an error. This is required to prevent spamming the SQS API if it fails to receive messages, e.g. due to invalid IAM permissions or an SQS outage.

medzin force-pushed the master branch from 29e4bef to ba6e8c0 Compare May 6, 2025 21:37

pront reviewed Jun 13, 2025

View reviewed changes

pront added the meta: awaiting author Pull requests that are awaiting their author. label Jun 25, 2025

Merge branch 'vectordotdev:master' into master

4e0d674

github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Jul 9, 2025

pront added the meta: awaiting author Pull requests that are awaiting their author. label Jul 9, 2025

pront force-pushed the master branch 4 times, most recently from 1720078 to ffe54be Compare July 10, 2025 15:43

pront closed this Oct 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(aws_s3 source): add retry delay in sqs::Ingestor #22999

fix(aws_s3 source): add retry delay in sqs::Ingestor #22999

Uh oh!

medzin commented May 6, 2025

Uh oh!

bits-bot commented May 6, 2025 •

edited

Loading

Uh oh!

pront left a comment

Uh oh!

Uh oh!

pront Jun 13, 2025

Uh oh!

medzin Jun 13, 2025 •

edited

Loading

Uh oh!

pront Jun 13, 2025

Uh oh!

medzin Jun 13, 2025

Uh oh!

pront commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(aws_s3 source): add retry delay in sqs::Ingestor #22999

fix(aws_s3 source): add retry delay in sqs::Ingestor #22999

Uh oh!

Conversation

medzin commented May 6, 2025

Summary

Change Type

Is this a breaking change?

How did you test this PR?

Does this PR include user facing changes?

Notes

References

Uh oh!

bits-bot commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pront left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pront Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

medzin Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pront Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

medzin Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

pront commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bits-bot commented May 6, 2025 •

edited

Loading

medzin Jun 13, 2025 •

edited

Loading